Information-Theory Interpretation of the Skip-Gram Negative-Sampling Objective Function

A dependency measure based on Jensen-Shannon

They define a dependency measure between two random variables, which is based on the Jensen-Shannon divergence.

The Kullback-Leibler (KL) divergence of a distribution $p$ from a distribution $q$ is defined as follows:


$KL(p||q)=\sum_{i \in A} p_i log \frac{p_i}{q_i}$

The mutual information between two jointly distributed random variables $X$ and $Y$ is defined as the KL divergence of the joint distribution $p(x,y)$ from the product $p(x)p(y)$ of the marginal distributions of $X$ and $Y$. This means:


$I(X;Y)=KL(p(x,y)||p(x)p(y))$

Similar to the KL-divergence and PMI.


$$JS_\alpha (p,q)= \alpha KL(p||q) + (1-\alpha) KL(q||r) \\ =H(r)-\alpha H(p) - (1-\alpha) H(q)$$

where $H(p)$ is the entropy function(i.e. $H(p)=-\sum_i p_i log p_i$).

They define the Jensen-Shannon Mutual Information (JSMI) as follows:

$JSMI_\alpha (X,Y) = JS_\alpha (p(x,y),p(x)p(y))$

As can be seen, the embeddings score gets close to the optimal value using higher dimen- sionality and more training iterations, but doesn’t surpass it. They showed that the optimization of skip-gram embeddings with negative sampling finds the best low-dimensional approximation of the JSMI measure.

分享到